Extraction of linguistic resources from multilingual corpora and their exploitation

نویسنده

  • Ahmad R. Shahid
چکیده

Increasing availability of on-line and off-line multilingual resources along with the developments in the related automatic tools that can process this information, such as GIZA++ (Och & Ney 2003), has made it possible to build new multilingual resources that can be used for NLP/IR tasks. Lexicon generation is one such task, which if done by hand is quite expensive with human and capital costs involved. Generation of multilingual lexicons can now be automated, as is done in this research work. Wikipedia1, an on-line multilingual resource was gainfully employed to automatically build multilingual lexicons using simple search strategies. Europarl parallel corpus (Koehn 2002) was used to create multilingual sets of synonyms, that were later used to carry out the task of Word Sense Disambiguation (WSD) on the original corpus from which they were derived. The theoretical analysis of the methodology validated our approach. The multilingual sets of synonyms were then used to learn unsupervised mod1http://www.wikipedia.org/

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploiting a Multilingual Web-based Encyclopedia for Bilingual Terminology Extraction

Multilingual linguistic resources are usually constructed from parallel corpora, but since these corpora are available only for selected text domains and language pairs, the potential of other resources is being explored as well. This article seeks to explore and to exploit the idea of using multilingual web-based encyclopedias such as Wikipedia as comparable corpora for bilingual terminology e...

متن کامل

Shared Resources for Multilingual Information Extraction and Challenges in Named Entity Annotation

Progress in natural language processing requires increasing amounts of data and annotation in a growing variety of languages, and research in named entity extraction is no exception. While the value of richlyannotated, large-scale multilingual corpora is undeniable, costs for producing such data are high, underscoring the value of shared resources. As part of the US Governmentsponsored Automati...

متن کامل

Parallel Corpora, Alignment Technologies and Further Prospects in Multilingual Resources and Technology Infrastructure

Multilingual technologies, which to a large extent are language independent, provide a powerful support for easier building of annotated linguistic resources for languages where such resources are scarce or missing. All these technologies require parallel corpora in order to achieve their ends. Parallel texts encode extremely valuable linguistic knowledge because the linguistic decisions made b...

متن کامل

Preparation and exploitation of bilingual texts

A bitext is a merged document composed of two versions of a given text, usually in two different languages. An aligned bitext is produced by an alignment tool or aligner, that automatically aligns or matches the versions of the same text, generally sentence by sentence. A multilingual aligned corpus or collection of aligned bitexts, when consulted with a search tool, can be extremely useful for...

متن کامل

Multilingual Resources for Entity Extraction

Progress in human language technology requires increasing amounts of data and annotation in a growing variety of languages. Research in Named Entity extraction is no exception. Linguistic Data Consortium is creating annotated corpora to support information extraction in English, Chinese, Arabic, and other languages for a variety of US Governmentsponsored programs. This paper covers the scope of...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012